A Computational Analysis of Collective Discourse
نویسندگان
چکیده
This paper is focused on the computational analysis of collective discourse, a collective behavior seen in nonexpert content contributions in online social media. We collect and analyze a wide range of real-world collective discourse datasets from movie user reviews to microblogs and news headlines to scientific citations. We show that all these datasets exhibit diversity of perspective, a property seen in other collective systems and a criterion in wise crowds. Our experiments also confirm that the network of different perspective co-occurrences exhibits the small-world property with high clustering of different perspectives. Finally, we show that non-expert contributions in collective discourse can be used to answer simple questions that are otherwise hard to answer. INTRODUCTION Collective behavior refers to social processes that are not centrally coordinated and emerge spontaneously (Blumer 1951). This definition distinguishes collective behavior from group behavior in a number of ways: (a) collective systems involve limited social interactions, (b) membership is fluid, and (c) it generates weak and unconventional norms (Smelser 1963). Collective behavior is normally characterized by a complex system (Miller & Page 2007). A complex system is a system composed of interconnected parts (agents, processes, etc.) that as a whole exhibit one or more properties called emergent behavior. The emergent behavior, which is not obvious from the properties of the individuals, is called to be nonlinear (not derivable from the summations of the activity of individual components). Nonlinear behavior has been widely observed in nature in the past. Gordon (1999) explains how harvester ants achieve task allocation without any central control and only by means of continual adjustment. Moreover she argues that the cooperative behavior in the ant colony merely results from local interactions between individual ants and not a central controller. For instance, in ant colonies individual members react to local stimuli (in the form of chemical scent) depending only on their local environment. In the absence of a centralized decision maker, ant colonies exhibit complex behavior to solve geometric problems like shortest paths to food or maximum distance from all colony entrances to dispose of dead bodies. Self-organized behavior is not specific to ants. Schools of fish, flocks of birds, herd of ungulate mammals are other examples of complex systems among animal groups (Fisher 2009). Similarly pedestrians on a crowded sidewalk exhibit self-organization that leads to forming lanes along which walkers move in the same directions (Boccara 2010). It is argued that all examples of complex systems exhibit common characteristics: 1. They are composed of a large number of inter-connected parts (i.e., agents) 2. The system is self-organized in that there is not central controller. 3. They exhibit emergent behavior: properties seen in the group but not observable In social sciences, a lot of work has been done on collective systems and their properties (Hong & Page 2009). However, there is only little work that studies a collective system in which individual members collectively describe an event or an object. In our work, we focus on the computational analysis of collective discourse, a collective behavior seen in interactive content contribution in online social media (Qazvinian & Radev 2011). In this paper, we show that collective discourse exhibits diversity of opinions, a property that is defined by (Surowiecki 2004) as a necessary criterion for wise crowds. BACKGROUND Previously, it has been argued that diversity is essential in intelligent collective decision-making. Page (2007) argues that the diversity of people and groups, which enable new perspectives, leads to better decision making. He finds that the diversity of perspectives in a collective system is associated with higher rates of innovation and can enhance the capacity for finding solutions to complex problems. Similarly, Hong & Page (2004) show that a random group of intelligent problem solvers can benefit from diversity and outperform a group of the best problem solvers. Prior work has also studied the diversity of perspectives in content contribution and text summarization. In prior work on evaluating independent contributions in content generation, Voorhees (1998) studied IR systems and showed that relevance judgments vary significantly between humans but relative rankings are PROCEEDINGS, CI 2012 more stable across annotators. Similarly, van Halteren & Teufel (2004) designed an experiment, which asked 40 Dutch students and 10 NLP researchers to summarize a BBC news report, resulting in 50 different summaries. They calculated the Kappa statistic (Carletta 1996, Krippendorff 1980) and observed high inter-judge agreement, suggesting that the task of atomic semantic unit (factoid) extraction can be robustly performed in naturally occurring text. The diversity of perspectives and the unprecedented growth of the factoid inventory have influenced other research areas in Natural Language Processing such as text summarization and paraphrase generation. Summarization evaluations are performed by assessing the information content with respect to salience and diversity in the summaries that are generated automatically (SpärckJones 1999, van Halteren & Teufel 2003, Nenkova & Passonneau 2004). Leveraging the diverse range of perspectives has also played a critical role in developing new paraphrase generation systems by providing massive amounts of data that is easily collectable. For instance, Chen & Dolan (2011) performed a study and collected highly parallel data, used for training paraphrase generation systems from descriptions that participants wrote for video segments from YouTube. Such parallel corpora of document pairs that represent the same semantic information in different languages can be extracted from user contributions in Wikipedia and be used for learning translations of words and phrases (Yih, Toutanova, Platt & Meek 2011). COLLECTIVE DISCOURSE With the growth of Web 2.0, millions of individuals involve in collective discourse. They participate in online discussions, share their opinions, and generate content about the same artifacts, objects, and news events in Web portals like amazon.com, epinions.com, imdb.com and so forth. This massive amount of text is mainly written on the Web by non-expert individuals with different perspectives, and yet exhibits accurate knowledge as a whole. In social media, collective discourse is often a collective reaction to an event. A collective reaction to a welldefined subject emerges in response to an event (a movie release, a breaking story, a newly published paper) in the form of independent writings (movie reviews, news headlines, citation sentences) by many individuals. To analyze collective discourse, we perform our analysis on a wide range of real-world datasets. Corpus Construction An essential step and an important contribution in our work is gathering a comprehensive corpus of datasets on collective discourse. We focus on social media consisting of independent contributions of many individuals. Furthermore, we focus on topics corresponding to specific items and events as opposed to issues that are evolving Dataset #clusters average #docs Movie reviews 100 965 Microblogs 15 110 News headlines 25 55 Citations 25 52 Table 1. Number and size of collective discourse datasets studied in this paper. and diffuse either in time or scope such as the economy or education. Table 1 lists the set of collective discourse corpora that we have analyzed as well as the number of datasets and average number of documents in each of them. In the following, we further explain each of these collective discourse corpora. Movie Reviews The first collective discourse that we are interested in analyzing is the set of reviews that non-expert users write about a movie. The set of online reviews about an object is a perfect case of collective human behavior. Upon its release, each movie, book, or product receives hundreds and thousands of online reviews from non-expert Web users. These reviews, while discussing the same object, focus on different aspects of the object. For instance, in movie reviews, some reviewers solely focus on a few famous actors, while some discuss other aspects like music or screenplay. To study collective discourse in movie reviews, we collected all the user reviews for 100 randomly selected movies from the top 250 movies list in the Internet Movie Database (IMDB). For each of these 100 movies, we also obtained plot keywords provided on the IMDB website. Our collected corpora consist of more than 96,500 user reviews posted for movies from 19 different genres. The following excerpts are extracted from user reviews for the movie Pulp Fiction, and show how non-expert reviewers focus on different aspects of the movie. “... starred by many well-known Actors, such as: John Travolta, Samuel L. Jackson, Uma Thurman, Bruce Willis and many. Directed by Quentin Tarantino, the eccentric Director ...” “... Pulp fiction was nominated for seven academy awards and won only one for screen writing ...” “... Shocking, intelligent, exciting, hilarious and oddly though-provoking. Best bit: Jackson’s Bible quote ...” Microblogs The second type of collective discourse that we study in our work is the set of tweets written about a news story. In addition to other advantages, using Twitter as a corpus of collective discourse does present unusual http://www.imdb.com/chart/top PROCEEDINGS, CI 2012 challenges. In Twitter, posts are limited to 140 characters and often contain information in an unusually compressed form. First, we use the set of tweets collected by (Qazvinian, Rosengren, Radev & Mei 2011) about Sarah Palin’s divorce rumor that was popular during the 2008 presidential election campaigns. This dataset contains tweets that are about this story and yet discuss it from different angles. For example, the following tweets are extracted from this dataset and reveal various facts about the story. One aspect is that a blogger has started the spread, and is threatened with libel suit. Another aspect is that the rumor has been debunked on Facebook. “Palins lawyer threatens divorce blogger with libel suit, gives her the option of receiving the summons at her resid... http://ow.ly/15JDO6.” “@jose3030 Palin divorce is supposedly debunked on Facebook, but I think they are just spinning it, until they can announce it.” “RT @mediaite: Sarah Palin uses Facebook to deny unsourced divorce rumors http://bit.ly/14Xy6h CH.” As our second Microblog dataset, we collected the tweets that talk about the cancellation rumors of 14 TV shows in August of 2011. For instance, one of our collected datasets is about the rumor that Charlie Sheen might go back to the TV show Two and a Half Men. “Charlie Sheen Claims ’Discussions’ About Returning to ’Two and a Half Men’: In Boston for his national tour, C... http://bit.ly/hIbOWf.” Charlie Sheen “Two And A Half Men” Return Not Happening: Report http://dlvr.it/LCTkd.” News Headlines Another collective discourse is seen when a story breaks and various news agencies write stories about it. These stories all talk about the story, but view it from different perspectives. We collected 25 news clusters from Google News2. Each cluster consists of a set of unique headlines about the same story, written by different sources. The following example shows 3 headlines in our datasets that are about hurricane Bill and its damage in Maine. “Hurricane Bill sweeps several people into ocean.” “7-year-old girl swept away by Bill wave dies after rescue.” “Maine ranger: wave viewers didn’t heed warnings.”
منابع مشابه
Structuring Discourse for Collective Interpretation
This paper reflects on three examples of a discourse-oriented approach to supporting collective interpretation. By this, we mean activities involving two or more people who are trying to make sense of an issue. The common theme linking the examples is that each mediates interpretive activity via a software environment which structures discourse: participants construct their interpretation withi...
متن کاملConstitutive Features of the Russian Political Discourse in Ecolinguistic Aspect
The article offers a comparative description of typological mechanisms used in political communicative practice and methods of verbal explication of its axiological and symbolic constituents determining universal mental features of individual/collective consciousness. The research position based on a systemic multilevel analysis of the component structure of discourse facilitates the identifica...
متن کامل300: Cultural Stereotypes and War against Barbarism
During the era of Bush administration and post-September 11th anti-terrorism discourse, the movie 300 was one of the best exemplar of a close relationship between Hollywood pop culture products and the neo-conservatives’ political discourse of nationalism. From my point of view, 300 is not an example of outstanding artistic films, but a film that more than any other film contains an Iranophobic...
متن کاملThe Path: Dızgun Bawa, As an Example of Relation between Belief and Life Style
This article is an anthropological examination and analysis of a Dersim-based mythical story, focusing on its meaning and function in belief and the practice of daily life. Within this scope, the Dızgun Bawa myth, revolving around a central sacred figure, is broached and analyzed here as a text comprising a basis for the construction of collective discourses giving way to socially functional me...
متن کاملUse of Cohesive Ties in English as a Foreign Language Students’ Writing
This study aims to understand certain linguistic and semantic resources for the text construction, namely the constructs of cohesion, coherence. The analysis of cohesive ties was conducted on the writing samples of 40 subjects (20 most coherent and 20 least coherent) Iranian undergraduates of English. This prompted us to identify the dominant types of cohesive devices used in most coherent writ...
متن کاملTitle A visualization of group cognition: semantic network analysis of a CSCL community
This paper reports our progress in using the Knowledge Space Visualizer (KSV) as a tool for formative assessment of online discourse. Whereas social network analysis has been used in research on computer-supported collaborative learning, it only examines the social structure of discourse participants, and does not provide information about the content of the discourse. We discuss two types of n...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1204.3498 شماره
صفحات -
تاریخ انتشار 2012